imports required to unpack datasets, load DDS (doing data science) data sets from a set of csv files


In [ ]:
import os
import zipfile
from metrique.core_api import PandasClient

def xall(path):
    z = zipfile.ZipFile(os.path.expanduser(path))
    z.extractall()

In [ ]:
#!mkdir ~/.metrique/repos

In [ ]:
%cd ~/.metrique/repos

Clone the metrique git repo; install metrique


In [ ]:
!git clone https://github.com/drpoovilleorg/metrique.git

Clone the oreilly doing data science sample dataset git repo; Unpack the dataset and load in


In [ ]:
!git clone https://github.com/oreillymedia/doing_data_science.git

In [ ]:
if not os.path.exists('nyt1.csv'):
    xall('doing_data_science/dds_datasets.zip')  # extracts various doing data science datasets
    xall('dds_datasets/dds_ch2_nyt.zip')  # extracts the nyt*.csv's

In [ ]:
z = PandasClient()

load up the datasets


In [25]:
# globs accepted; the single ? only samples the first 
# 10 files; takes 10s+
%time nyt = z.load('./nyt?.csv')


CPU times: user 10.1 s, sys: 1.48 s, total: 11.6 s
Wall time: 11.7 s

In [26]:
%time ch5_binary = z.load('./dds_datasets/dds_ch5_binary-class-dataset.txt', sep='\t')


CPU times: user 103 ms, sys: 0 ns, total: 103 ms
Wall time: 104 ms

run pandas analysis


In [27]:
nyt.Impressions.hist()


Out[27]:
<matplotlib.axes.AxesSubplot at 0xa5818d0>

In [28]:
ch5_binary.last_sv.hist()


Out[28]:
<matplotlib.axes.AxesSubplot at 0xa9a4ad0>

In [28]:


In [ ]: